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Abstract —Heterogeneous face recognition (HFR) refers to matching face images acquired from different sources (/'.e., different sensors 
or different wavelengths) for identification. HFR plays an important role in both biometrics research and industry. In spite of promising 
progresses achieved in recent years, HFR is still a challenging problem due to the difficulty to represent two heterogeneous images in 
a homogeneous manner. Existing HFR methods either represent an image ignoring the spatial information, or rely on a transformation 
procedure which complicates the recognition task. Considering these problems, we propose a novel graphical representation based 
HFR method (G-HFR) in this paper. Markov networks are employed to represent heterogeneous image patches separately, which takes 
the spatial compatibility between neighboring image patches into consideration. A coupled representation similarity metric (CRSM) 
is designed to measure the similarity between obtained graphical representations. Extensive experiments conducted on multiple 
HFR scenarios (viewed sketch, forensic sketch, near infrared image, and thermal infrared image) show that the proposed method 
outperforms state-of-the-art methods. 

Index Terms —Heterogeneous face recognition, graphical representation, forensic sketch, infrared image, thermal image. 
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1 Introduction 

ace images captured through different sources, such 
as sketch artists and infrared imaging devices, are 
called in different modalities, i.e. heterogeneous face 
images. Matching face images in different modalities, 
which is referred as heterogeneous face recognition 
(HFR), is now attracting growing attentions in both 
biometrics research and industry. For instance, there are 
circumstances where the photo of the suspect is not 
available and matching sketches to a large-scale database 
of mug shots is desired; Matching near infrared (NIR) 
images or thermal infrared (TIR) images to visual (VIS) 
images is important for biometric security control to 
handle complicated illumination conditions. 

Because of the great discrepancies between hetero¬ 
geneous face images, conventional homogeneous face 
recognition methods perform poorly by directly iden¬ 
tifying the probe image (e.g. face sketch or infrared 
image) from gallery images (e.g. face photos). Existing 
approaches can be generally grouped into three cat¬ 
egories: synthesis-based methods, common space pro¬ 
jection based methods, and feature descriptor based 
methods. Synthesis-based methods [1], [2], [3], [4], [5], 
[6] first transform the heterogeneous face images into 
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the same modality (e.g. photo). Once the synthesized 
photos are generated from non-photograph images or 
vice versa, conventional face recognition algorithms can 
be applied directly. However, the synthesis process is 
actually more difficult than recognition and the perfor¬ 
mance of these methods heavily depends on the fidelity 
of the synthesized images. Common space projection 
based methods [7], [8], [9], [10], [11], [12] attempt to 
project face images in different modalities into a common 
subspace where the discrepancy is minimized. Then het¬ 
erogeneous face images can be matched directly in this 
common subspace. Yet the projection procedure always 
causes information loss which decreases the recognition 
performance. Feature descriptor based methods [13], 
[14], [15], [16], [17] first represent face images with 
local feature descriptors. These encoded descriptors can 
then be utilized for recognition. However, most existing 
methods of this category represent an image ignoring 
the special spatial structure of faces, which is crucial for 
face recognition in reality. 

This paper proposes a novel graphical representation 
based HFR approach (G-HFR), which does not rely on 
any synthesis or projection procedure but takes spatial 
information into consideration. After face images are 
divided into overlapping patches, Markov networks are 
employed to model the relationship between homoge¬ 
neous image patches based on a representation dataset. 
The representation dataset consists of a number of het¬ 
erogeneous face image pairs. Then the weight matrixes 
generated from the Markov networks are regarded as 
graphical representations, which are irrelevant to hetero¬ 
geneity. Therefore, the similarity between the weight ma¬ 
trixes of heterogeneous face images is used for matching. 
Considering the spatial structure between heterogeneous 
face image patches, a coupled representation similarity 
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metric (CRSM) is designed to measure the similarity be¬ 
tween their graphical representations. Finally, calculated 
similarity scores between heterogeneous face images are 
applied for recognition. 

The performance of the proposed G-HFR approach is 
thoroughly validated on four HFR scenarios: the viewed 
sketch database (the CUHK Face Sketch FERET Database 
(CUFSF) [17]), the forensic sketch database (IIIT-D Sketch 
Database [18], PRIP Viewed Software-Generated Com¬ 
posite Database (PRIP-VSGC) [19], our collected forensic 
sketch database), the near infrared database (the CASIA 
NIR-VIS 2.0 Face Database [20]), and the thermal in¬ 
frared database (the Natural Visible and Infrared facial 
Expression Database (USTC-NVIE) [21]). Experimental 
results illustrate that the proposed approach achieves 
superior performance in comparison to state-of-the-art 
methods. 

The main contributions of this paper are summarized 
as follows: 

1) We employ Markov networks to obtain graphical 
representations for representing heterogeneous face 
images, which firstly takes spatial information into 
consideration; 

2) A coupled representation similarity metric is de¬ 
veloped for matching, which considers the spa¬ 
tial structure between heterogeneous face image 
patches; 

3) Leading accuracies are achieved on multiple HFR 
scenarios which illustrates the effectiveness of the 
proposed method. 

In this paper, excepted when noted, a bold lowercase 
letter denotes a column vector and a bold uppercase 
letter stands for a matrix. The regular lowercase and 
uppercase letters denote scalars. The organization of the 
rest of this paper is as follows. Section 2 gives a review 
on representative HFR methods. Section 3 presents the 
proposed graphical representation approach for HFR. 
Section 4 shows the experimental results and analysis 
and the conclusion is drawn in Section 5. 

2 Related Work 

In this section, we briefly review representative HFR 
methods in aforementioned three categories: synthesis- 
based methods, common space projection based meth¬ 
ods, and feature descriptor based methods. 

Synthesis-based HFR methods began with an eigen- 
transformation algorithm [3] proposed by Tang and 
Wang. Later, Liu et al. [2] proposed a locally linear em¬ 
bedding approach for patch-based face sketch synthesis. 
The sketch patches were synthesized independently and 
the spatial compatibility between neighboring patches 
was neglected. Chen et al. [22] proposed to learn the 
locally linear mappings between NIR and VIS patches 
in a similar manner as [2]. Gao et al. [1] employed 
embedded hidden Markov model to represent the non¬ 
linear relationship between sketches and photos and a 


selective ensemble strategy [23] was explored to synthe¬ 
size a sketch. Wang and Tang [5] proposed a multi-scale 
Markov random field model for face sketch-photo syn¬ 
thesis, which takes the spatial constraints between neigh¬ 
boring patches into consideration. Li et al. [6] proposed 
a learning-based framework to synthesize photos from 
thermal infrared images and the Markov random field 
model was applied to improve the synthesized result. 
Zhou et al. [24] proposed a Markov weight field model 
which was capable of synthesizing new patches that do 
not appear in the training set. Wang et al. [4] presented a 
transductive face sketch-photo synthesis method which 
incorporates the test image into the learning process. 

In order to minimize the intra-modality difference, Lin 
and Tang [9] proposed a common discriminant feature 
extraction (CDFE) approach to map heterogeneous fea¬ 
tures into a common feature space. The canonical correla¬ 
tion analysis (CCA) was applied to learn the correlation 
between NIR and VIS face images by Yi et al. [12]. Lei 
and Li [8] proposed a subspace learning framework for 
heterogeneous face matching, which is called coupled 
spectral regression (CSR). They later improved the CSR 
by learning the projections based on all samples from 
all modalities [25]. Sharma and Jacobs [11] used partial 
least squares (PLS) to linearly map images from different 
modalities to a common linear subspace. A cross modal 
metric learning (CMML) algorithm was proposed by 
Mignon and Jurie [10] to learn a discriminative latent 
space. Both the positive and negative constraints were 
considered in metric learning procedure. Kan et al. [7] 
proposed a multi-view discriminant analysis (MvDA) 
method to obtain a discriminant common space for 
recognition. The correlations from both inter-view and 
intra-view were exploited. 

A number of feature descriptor based HFR approaches 
have shown promising performances. Klare et al. [16] 
proposed a local feature-based discriminant analysis 
(LFDA) framework through scale invariant feature trans¬ 
form (SIFT) feature [26] and multiscale local binary 
pattern (MLBP) feature [27]. A face descriptor based 
on coupled information-theoretic encoding was designed 
for matching face sketches with photos by Zhang et 
al. [17]. The coupled information-theoretic projection 
tree was introduced and was further extended to the 
randomized forest with different sampling patterns. An¬ 
other face descriptor called local radon binary pattern 
(LRBP) was proposed in [13]. The face images were 
projected onto the radon space and encoded by local 
binary patterns (LBP). A histogram of averaged ori¬ 
ented gradients (HAOG) face descriptor was proposed 
to reduce the modality difference [14]. Lei et al. [28] 
proposed a discriminant image filter learning method 
benefitted from LBP like face representation for matching 
NIR to VIS face images. Alex et al. [29] proposed a local 
difference of Gaussian binary pattern (LDoGBP) for face 
recognition across modalities. 

With great progresses achieved on viewed sketches, 
recently researches began to focus on matching forensic 
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sketches to mug shots. Klare et al. [16] matched forensic 
sketches to mug shot photos with a populated gallery. 
Bhatt et al. [30] proposed a discriminative approach 
for matching forensic sketches to mug shots employing 
multi-scale circular Weber's local descriptor (MCWLD) 
and an evolutionary memetic optimization algorithm. 
Klare and Jain [15] represented heterogeneous face im¬ 
ages through their nonlinear kernel similarities to a 
collection of prototype face images. Considering the 
fact that many law enforcement agencies employ facial 
composite software to create composite sketches, Han 
et al. [31] proposed a component based approach for 
matching composite sketches to mug shot photos. 

3 Graphical Representation for Het¬ 
erogeneous Face Recognition 

In this section, we present a new approach for HFR. 
Without loss of generality and for ease of representation, 
we take face sketch-photo recognition as an example to 
describe the proposed method. A representation dataset 
composed of face sketch-photo pairs is constructed in 
the begining, which is utilized to extract the graphical 
representations of the gallery and probe images. Con¬ 
sidering a representation dataset with M face sketch- 
photo pairs {(s 1 , p 1 ), ■ ■ ■ , (s M , p M )}, we first divide each 
face image into N overlapping patches. The probe sketch 
t and the gallery photos {g 1 , • ■ • ,g L } are also divided 
into N overlapping patches correspondingly. Here L 
denotes the number of photos in the gallery. For a probe 
sketch patch y t(i = 1 , 2 , • • • , N), we can find K nearest 
sketch patches from the sketches in the representation 
dataset within the search region around the location of 
yj. The probe sketch patch y, can then be regarded as 
a linear combination of the K nearest sketch patches 
{yi,i» * • * , y i,x} weighted by a column vector w yi = 
(%,!,"■ ,w yi<K ) T ■ The weight vector w yi is regarded 
as a representation of the probe sketch patch y*. For 
a gallery photo patch x- from the /th gallery photo g l , 
where l = 1 , 2 , ■ • • , L, we can also find K nearest photo 
patches from the photos in the representation dataset 
and reconstruct the photo patch by a linear combination 
of these K nearest photo patches weighted by w x i. The 
weight vector w x i is regarded as a representation of 
the gallery photo patch x-. The proposed approach is 
based on the observation that two heterogeneous face 
image patches corresponding to the same location from 
the same person tend to have similar representations, 
and the representations of two heterogeneous face image 
patches from different persons usually differ greatly. 

The reconstruction weights can be simply gener¬ 
ated through conventional subspace learning approaches 
such as principal component analysis (PCA) [32] and 
locally linear embedding (LLE) [33]. However, these ap¬ 
proaches neglect the spatial structure information which 
is essential for face recognition. To this end, we propose 
to utilize Markov networks to represent heterogeneous 
face image patches separately, which take full advantage 


of the spatial compatibility between adjacent patches. 
Once graphical representations for probe sketch patches 
and gallery photo patches are obtained, a CRSM to mea¬ 
sure the similarity between the probe sketch t and the 
gallery photo g l is designed. Figure 1 gives an overview 
of the proposed method. The details are introduced as 
follows. 

3.1 Graphical Representation 

Inspired by the successful application of Markov net¬ 
works on synthesis scenarios [5], [24], we jointly model 
all patches from a probe sketch or from a gallery photo 
on Markov networks. The joint probability of the probe 
sketch patches and the weights is defined as 

p{ w yi i" ‘ ,w yjv ,yi,--- ,yjv) 

= II ^ 

i (i,j)€S 

where (i,j) € S denotes that the / tTi probe sketch 
patch and the yth probe sketch patch are adjacent. 
S represents the edge set in the sketch layer of the 
Markov networks. f(yi) means the feature extracted 
from the probe sketch patch y* and f(w yi ) denotes 
the linear combination of features extracted from neigh¬ 
boring sketch patches in the representation dataset, i.e. 

f ( w yJ = Efc=i«W f (y*.fc)- $ ( f (y*)’ f ( w yJ) is the local 

evidence function, and T (w yi . w yj ) is the neighboring 
compatibility function. 

The local evidence function <f>(f(y;), f(w y J) is defined 
as 

$(f(y,)H(w y J) 

K (2) 
oc exp{-||f( yi ) -^] w yi ,J(yi,fc)|| 2 / 2 4} 

fe =i 

The rationale behind the local evidence function is that 
Efc=i w yi.k^(yi,k) should be similar to f(y*). Then the 
weight vector w yi is regarded as a representation of the 
probe sketch patch y,. 

The neighboring compatibility function ’T (w yi , w y; ) is 
defined as 

^(wyi.Wy,) 

K K (3) 

°c exp{ — k o 3 ik - V 2 4} 

fc=l fc=l 

where o j k represents the vector consisting of intensity 
values extracted from overlapping area (between the ith 
probe sketch patch and the yth probe sketch patch) in 
the kth nearest sketch patch of the ith probe sketch 
patch. The neighboring compatibility function is utilized 
to guarantee that neighboring patches have compatible 
overlaps. 

Maximizing the joint probability function (1), we can 
obtain the optimal representations for the probe sketch. 
By substituting equations (2) and (3) into equation (1), 
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Fig. 1. Overview of the proposed graphical representation based heterogeneous face recognition. 


maximizing the joint probability function (1) is equiva¬ 
lent to the minimization problem as follows. 

W Y \\J2 w y^°ik-J2^y j , k °ik\\ 2 

* (*d)eH fc=1 k =1 

1 AT if 

+^r Y ll f (y<) - Y w yiJ(y^)\\ 2 (4) 

^ i —1 k—1 

K 

s.t. E%u = 1.0<% t <l 
fc=i 

i = 1,2, • • ■ , TV, fc = 1,2, • • • ,iT 

where w is the concatenation of {w yi ,--- . w yv } in a 
long-vector form. Equation (4) can be further simplified 
as 

N 

nun a Y ||0 7 w yi - 0}w y . || 2 + Y ||f (y*) - F iWyi || 2 

(idle s i=i 

( 5 ) 

where a = 8%/8%. F, and O- are two matrices, with the 
fcth column being f(y, ) and oj fc , respectively. Equation 
(5) can be rewritten as the following problem. 

min w 7 Qw + w T c + b 

W 

K 

s.t. Y w Vi,k = 1 > 0 < '“W < 

fc=i 

i = 1,2, ■ - • ,1V, k = 1,2, - ■ ■ ,K 


where 

Q=« E (°i “ Oj) T (°? ” O’-) + Y FfFi 

(zJ)ge: i 

N 

c = -2^Fff(y 0 
2=1 

fr = ^ fT (yO f (yO 

i=l 

The bias term b has no effect on the optimization 
problem and we can ignore it. The problem in equa¬ 
tion (6) is optimized by the cascade decomposition 
method [24] and then we obtain the weight matrix of 
the probe sketch W t = [w yi ,--- , w yjv ]. The weight 
matrix W g i = [w x i , ■ • • , w x ij of the /tTi gallery photo 
g l can be obtained in a similar way as aforementioned 
by jointly model all the gallery photo patches from g l 
and corresponding neighboring photo patches in the 
representation dataset. 

To match the representation w yi of a probe sketch 
patch y i to the representation w x i of the gallery photo 
patch x-, where l = 1,2, ,L, these weight vectors 

are reformulated as M-dimensional vectors (originally, 
these vectors are /v-dimensional vectors). For the ease 
of denotations, these reformulated vectors are still rep¬ 
resented as before. Each reformulated vector has at most 
K nonzero values. For example, % ilZ ( z = 1,2, - - • ,M) 
is nonzero only if the ith patch extracted from the zth 
sketch in the representation dataset is among the K 
nearest neighbors of the probe sketch patch y,. 
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3.2 Coupled Representation Similarity Metric 

In order to measure the similarity between two rep¬ 
resentations W t and W g i, we calculate the similarity 
of each coupled patch pair respectively. Here "couple" 
means that the two column vectors extracted from Wt 
and W g i have the same column order. There are many 
common metric functions to measure the similarity be¬ 
tween two vectors, such as LI norm, L2 norm. Loo 
norm, the cosine distance, and the chi-square distance. 
However, these common metrics cannot fully exploit the 
characteristics of the proposed graphical representation, 
i.e. two graphical representations corresponding to the 
same position in coupled heterogeneous face images 
share similar semantic meanings. For example, w yi z and 
w’ x ( represent the weights of the sketch patch and photo 
patch from the zth (z = 1,2, • • • ,M ) sketch-photo pair in 
the representation dataset. Here we utilize the weights 
which share the same neighbors in the graphical repre¬ 
sentations to describe the semantic similarity. Inspired by 
the rank-based similarity measure in [34], we propose a 
new similarity measure, namely coupled representation 
similarity metric (CRSM), to cater for this principle. 

We compute the similarity score of the probe sketch 
patch y, and the gallery photo patch x( as the sum of 
the weights sharing the same nearest neighbors. 

M 

s(yu x i) = 0.5^n z (w yi , + (7) 

Z—l 

where 


f k w yi,z > 0 and w x i > 0 
\ 0, otherwise 


The effect of the number of nearest neighbors K on 
the similarity measurement is shown in Figure 2. The 
similarity map images of three sketch-photo pairs from 
the CUFSF database are shown as examples. The first 
two pairs are of the same person and the third pair is 
of different persons. We have quantified the similarity 
map images into binary images for better visualization, 
where the bright area denotes that the corresponding 
similarity score is larger than 0.5. We find that similarity 
map images corresponding to heterogeneous faces of the 
same person tend to have more bright areas than those 
of different persons have. Considering the constraints 


M 
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0.1768 0.2350 0.2893 0.3448 0.3982 0.4464 



0.2973 0.3614 0.4242 0.4782 0.5230 0.5753 



0.1033 0.1484 0.1968 0.2382 0.2803 0.3288 


Fig. 2. Examples of the obtained similarity map images. 
The left two columns show three sketch-photo pairs from 
the CUFSF database. The first two pairs are of the same 
person, and the third pair is of different persons. The 
corresponding similarity map images obtained are shown 
in the right of the sketch-photo pairs. The size of the 
similarity map image is the same to the face image. 
We have quantified the similarity map images into binary 
images for better visualization. The bright area indicates 
that the corresponding similarity score is larger than 0.5. 


need statistical learning, the fusion of multiple similar¬ 
ity metrics through statistical learning would further 
improve the performance, which will be shown in the 
experimental section. 

Algorithm 1. Graphical Representation for HFR (G- 
HFR) 

1: Input: representation dataset 

(s 1 . p 1 ), • • • , (s M , p A/ ), a probe sketch t, gallery 
photos {g 1 ,--- ,g L }. 

2: Initialize: w yi = [1/iT, , 1 /K\, w x i = 

[1 IK,-- - ,1/iT], i = and / = 1, - ,L; 

divide face images into even overlapping patches. 

3: Search K nearest neighbors for each probe sketch 
patch and gallery photo patch respectively. 

4: Solve the minimization problem (6) to compute 
graphical representations of probe sketch t and 
gallery photos {g 1 , ■ • • ,g L } respectively. 

5: Compute the similarity scores according to (7). 

6: Output: the matched photo with largest similarity 
score. 

4 Experiments 


the proposed similarity measure ranges from 0 to 1. 

The average of the similarity scores on all patch 
positions can be regarded as the final similarity score 
between the probe sketch and the gallery photo, which 
is used for matching. In Figure 2, the numbers below 
the similarity map images are similarity scores obtained. 
The proposed graphical representation for HFR is sum¬ 
marized in Algorithm 1 below. It should be noticed that 
although the process described in Algorithm 1 does not 


In this section, we evaluated the performance of the 
proposed approach on four HFR scenarios tasks (viewed 
sketch, forensic sketch, near infrared image, and thermal 
infrared image). We first evaluated the effectiveness of 
the proposed graphical representation and the effective¬ 
ness of CRSM separately. Then we investigated the effect 
of different parameters and number of features on the 
recognition performance. Finally we validated that our 
approach achieved superior performance compared with 



















6 



(a) (b) (c) (d) (e) (f) 


Fig. 3. Example images of heterogeneous faces tested in 
this paper, (a) Viewed sketch-photo pair from the CUFSF 
database, (b) Semi-forensic sketch-photo pair from the 
IIIT-D sketch database, (c) Composite sketch-photo pair 
from the PRIP-VSGC database, (d) Forensic sketch- 
photo pair from our collected forensic sketch database, 
(e) Near infrared image-photo pair from the CASIA NIR- 
VIS 2.0 face database, (f) Thermal infrared image-photo 
pair from the USTC-NVIE database. 

state-of-the-art methods on multiple heterogeneous face 
databases. 

4.1 Databases 

Four different HFR scenarios are tested in this section. 
Example faces are shown in Figure 3. Note that all the 
experiments are conducted with randomly partition the 
dataset into the representation set, the training set, and 
the test set. The accuracies reported in this paper are 
statistical results over 10 random partitions. 

4.1.1 Viewed Sketch Database 

The CUHK Face Sketch FERET Database (CUFSF) [17] 
includes 1194 sketch-photo pairs with photos collected 
from the FERET database [35]. The viewed sketches are 
drawn by the sketch artist when viewing the photo 
images. There are lighting variations in the photos and 
shape exaggerations in the sketches of this database. On 
the CUFSF database, 250 persons are randomly selected 
as the representation dataset, and 250 persons are ran¬ 
domly selected as the set for training classifiers (namely 
training set). The remaining 694 persons form the testing 
set. Note that there is another viewed sketch database, 
the CUHK face sketch database (CUFS) [5], which is 
relatively easy for state-of-the-art methods including our 
method to achieve accuracies higher than 99%. There¬ 
fore, we skip over the CUFS database in this paper. 

4.1.2 Forensic Sketch Databases 

We consider three types of forensic sketches in this 
paper: semi-forensic sketches, composite sketches, and 
forensic sketches. IIIT-D Sketch Database [18] contains 
140 semi-forensic sketch-photo pairs with photos col¬ 
lected from different sources. The semi-forensic sketches 
are drawn based on the memory of sketch artist rather 
than directly viewing the photo image. The semi- 
forensic sketches can help bridge the gap between 


viewed sketches and forensic sketches. On the IIIT-D 
Sketch Database, the CUHK AR database [5] including 
123 sketch-photo pairs is chosen as the representation 
dataset. We follow the same partition protocol in [36] and 
randomly selected 124 semi-forensic sketch-photo pairs 
for training the classifiers. Our collected forensic sketch 
database containing 168 real world forensic sketches 
with corresponding mug shot photos are used for test. 

PRIP Viewed Software-Generated Composite 
Database (PRIP-VSGC) [19] contains 123 subjects, 
with photos from the AR database [37] and composite 
sketches created using FACES [38] and Identi-Kit [39]. 
The composite sketches are created with facial composite 
software kits which synthesize a sketch by selecting a 
collection of facial components from candidate patterns. 
On the PRIP-VSGC database, we randomly selected 123 
sketch-photo pairs from the CUHK Student database 
[5] to form the representation dataset. The classifiers are 
trained on the CUFSF database here. The 123 composite 
sketches generated using Identi-Kit 1 are used for test. 

Our collected forensic sketch database contains 168 
real world forensic sketches and corresponding mug shot 
photos. The forensic sketches are drawn by sketch artists 
with the descriptions of eyewitnesses or victims. This 
database originates from a collection of images from the 
forensic sketch artist Lois Gibson [40], the forensic sketch 
artist Karen Taylor [41], and other internet sources. On 
the forensic sketch database, the CUHK AR database 
including 123 sketch-photo pairs is chosen as the repre¬ 
sentation dataset. We follow the same partition protocol 
in [15] and 112 persons from the forensic sketch database 
are randomly selected as the training set. The remaining 
56 persons are used for test. 

4.1.3 Near Infrared Database 

The CASIA NIR-VIS 2.0 Face Database [20] contains 
725 subjects, with near infrared images and photos cap¬ 
tured by NIR and VIS cameras respectively. The age 
distribution of the subjects ranges from children to old 
people. Different from some existing methods [12], [8], 
[25] which benefit from multiple images per subject in 
training and gallery, only one NIR and one VIS image 
per subject are randomly selected in this paper to make 
the scenario more difficult. Therefore, there are totally 
725 near infrared image-photo pairs, of which 100 pairs 
are randomly selected as the representation dataset. We 
randomly select 417 pairs to train the classifiers and the 
rest 208 pairs are used for test. 

4.1.4 Thermal Infrared Database 

The Natural Visible and Infrared facial Expression 
Database (USTC-NVIE) [21] contains 215 subjects, with 
photos captured by a visible camera and thermal in¬ 
frared images captured by a infrared camera. There are 
illumination and facial expression variations as well as 

1. Currently only the 123 composite sketches generated using Identi- 
Kit are available in the PRIP-VSGC database. 
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glasses disguise effect in this database. Following the 
same strategy with the near infrared database above, we 
randomly select one TIR and one VIS image per subject 
to make this scenario more difficult, too. There are totally 
129 thermal infrared image-photo pairs 2 . On the thermal 
infrared database, 60 thermal infrared image-photo pairs 
are randomly selected as the representation dataset. We 
further randomly select 30 pairs to form the training set 
and the remaining 39 pairs are used for test. 

4.1.5 Enlarged Gallery 

A collection of 10,000 face photo images of 5,329 persons 
was used to increase the scale of the gallery, which mimic 
the real-world face retrieval scenarios, e.g. applications 
in law enforcement. The face photos in the enlarged 
gallery set are collected from four databases: the FERET 
database (2,722 photos) [35], the XM2VTS database (1,180 
photos) [42], the CAS-PEAL database (3,098 photos) [43], 
and the labeled faces in the wild-a (LFW-a) database 
(3,000 photos) [34]. The face images in the first three 
databases used are all captured under controlled con¬ 
ditions and their qualities are similar with those of 
the gallery sets in this paper. In order to increase the 
diversity of the enlarged gallery set, the LFW-a database 
is also used to construct the enlarged gallery set here. 
Experiments with an enlarged gallery can make results 
much closer to real-world FIFR scenarios. 

4.2 Experimental Settings 

The parameters appeared in this paper are set as follows. 
A simple geometry alignment based upon five points 
(centers of two eyes, nose tip, left mouth corner, and 
right mouth corner) is performed on the face images 
used in this paper. These five facial points are auto¬ 
matically detected by the facial point detection method 
[44], and error points are corrected manually. The only 
exception is that the facial points of the thermal infrared 
images are manually located. Each face image is cropped 
to 100 x 125 based on the facial points. The image patch 
size is 10 x 10, and the overlapping area is 50%, i.e. there 
are Pm = 456 patches per image. The neighborhood 
search region is 16 x 16. In the Markov networks, we 
do not set 5$ or Sy directly, but instead a is set to 0.025, 
where a = 8%/8%. Three local descriptors, i.e., SURF [45], 
SIFT [26], and histograms of oriented gradients (HOG) 
[46], are used in this paper. Each local descriptor is 
extracted from image patches with size of 10 x 10. For 
SURF, we employ the implementation embedded in the 
MATLAB software (available from the R2012b version), 
where the standard SURF-64 version was utilized. We 
manually set the center of the image patch as the interest 
point. The default parameter settings are selected and a 
64-dimensional vector is returned as the SURF descrip¬ 
tor. For SIFT, we use an open source library [47]. The 
center of the image patch is taken as the interest point 

2. Due to the loss of some thermal and visible videos [21], only 129 
subjects are available in the USTC-NVIE database. 


and we apply the default parameter settings to obtain the 
standard 128-dimensional vector. The HOG descriptor 
is also obtained through the open source library [47], 
The 10 x 10 image patch is taken as the input and the 
cellsize is set to 5. A 124-dimensional vector is generated 
as the HOG descriptor. To determine other experimental 
settings, we conducted adjustment experiments on the 
CUFSF database. Once these experimental settings are 
determined, they are kept constant in following experi¬ 
ments. 

For the generation of the graphical representation, the 
most time-consuming part lies in the neighbor searching 
phase and the optimization phase. For given input probe 
sketch patch, we first find the best match patch from 
each sketch in the representation dataset around the 
search region. Then we select K most similar sketch 
patches as the candidates. The complexity of this process 
is 0(P c PMMPf). Here P c is the number of candidates in 
the search region around one patch. Pm is the number 
of patches per image. M is the number of face image 
pairs in the representation dataset and Pf is the dimen¬ 
sionality of the local descriptor. The optimization phase 
mainly depends on the number of iterations. When the 
iteration number is 20, it takes about 5 minutes to 
obtain the graphical representation of an input probe 
sketch from the CUFSF database. After being represented 
by the proposed graphical representation, the weight 
vector size of each image patch is M. Therefore, the 
feature dimension of graphical representation for each 
image is MPm ■ The complexity of the matching process 
is O(MPm)- In our experiments it takes about 4.2ms 
for one matching operation. All the experiments and 
computations are conducted on an Intel Core i7-4790 
3.60GHz PC under MATLAB R 2012b environment. 

To illustrate the effectiveness of the proposed graphi¬ 
cal representations, we first replace the Markov networks 
with the locally linear embedding [33] which ignores the 
spatial information. In order to better demonstrate the 
improvement brought by the neighboring compatibility 
function in equation (1), we further conduct the exper¬ 
iment without the compatibility function. Speeded up 
robust features (SURF) [45] are utilized as the feature 
descriptor and the number of the nearest neighbors K is 
set to 15. As shown in the left top subfigure of Figure 4, 
the spatial information is essential for HFR. By consid¬ 
ering the relationship between neighboring patches (i.e. 
taking the compatible function into consideration), the 
proposed method achieved superior performance. 

To justify and illustrate the effectiveness of the pro¬ 
posed similarity metric (CRSM), we compare it with 
LI norm, L2 norm Loo norm, the cosine distance, and 
the chi-square distance. SURF is utilized as the feature 
descriptor and K is set to 15. The Loo norm is almost 
invalid on the proposed graphical representations, with 
a first match rate of 1.15%. The comparison of the 
proposed similarity metric with other common metric 
functions is shown in the right top subfigure of Figure 
4. The L2 norm and the chi-square distance perform 
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Fig. 5. Experiments on using different features and the 
fusion of them on the CUFSF database. 


Fig. 4. Left top subfigure shows the evaluation for the 
necessity of spatial information; right top subfigure shows 
the comparison of the proposed CRSM with common 
similarity metrics; left bottom subfigure shows the accu¬ 
racies of different numbers of the nearest neighbors K\ 
right bottom subfigure shows the accuracies by fusion of 
similarity metrics. All the four experiments are conducted 
on the CUFSF database using the SURF feature. 


poorly on the proposed graphical representations. This is 
because these two metrics cannot exploit the character¬ 
istics of the proposed graphical representation, i.e., there 
are at most K nonzero values in the M-dimensional 
vector, and simultaneously the same positions of two 
representation vectors in different images share similar 
semantic meanings. The proposed similarity measure is 
designed to cater for these characteristics and therefore 
more effective than LI norm and cosine distance. 

We evaluate the effect of the number of nearest neigh¬ 
bors K with SURF as the feature descriptor. K is set to 
15,20,25,30,35,40,45,50,55,60,65, 70, 75 and 80 respec¬ 
tively. As shown in the left bottom subfigure of Figure 4, 
the recognition accuracy varies with different K values, 
and there is not a smooth relationship between K and 
the accuracy. The rationale behind this is due to the 
small samples in the experiment. This inspires us to take 
the fusion of similarity metrics with different K values 
which may improve the performance (actually, this point 
is proved in the following experiments). Considering 
that with the increase of K, more memory space is 
required. In the following experiments we simply set 
K to 15,20,25,30,35 and 40, which is sufficient for 
recognition performance. 

In our experiments, we find that fusion of different 
similarity metrics corresponding to different K values 
would further improve the performance. We explore a 
linear one-class support vector machine (SVM) to fuse 
the similarity scores obtained by different K values. We 


follow the fusion strategy in [17] and select all the in¬ 
trapersonal pairs and the same number of interpersonal 
pairs with largest similarity scores to train the one-class 
SVM. As shown in the right bottom subfigure of Figure 
4, the increase of the number of similarity metrics does 
improve the recognition accuracy. The rationale behind 
this is that complementary information exists among dif¬ 
ferent similarity metrics. Combining 6 similarity metrics 
increases the accuracy from 92.22% to 94.24%. 

We also investigate the effect of the fusion of dif¬ 
ferent features on the recognition performance. Because 
the proposed method represents the heterogeneous face 
images in each modality separately, common features 
used in homogeneous face recognition are sufficient 
for the task. In this paper, SURF [45], SIFT [26], and 
FIOG [46] are employed to represent an image patch 
respectively. For each local descriptor, multiple graphical 
representations can be generated with multiple K values. 
These graphical representations obtained based on the 
three descriptors are then fused through the one-class 
SVM, following the same strategy in [17], Note that 
there are many other features which can also be used 
in the proposed method. However, since this paper 
mainly focuses on investigating the performance under 
the graphical representation framework, the selection 
of different types of features exceeds the scope of this 
work. Figure 5 shows that fusing models obtained from 
three features separately further improves the accuracy 
from 94.24% (SURF), 89.48% (SIFT), and 89.05% (HOG) 
to 96.04% respectively. This validates that fusion of the 
similarity metrics with different features boosts the per¬ 
formance. 

In following experiments, G-HFR extracts three fea¬ 
tures aforementioned and 6 similarity metrics are 
calculated for each feature (corresponding to K = 
15,20,25,30,35 and 40 respectively). These 18 metrics 
are fused by one-class SVM for final recognition task, 
excepted when noted. 
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Method 

Accuracy 

Method 

Accuracy 

TFSPS [4] 

72.62% 

PLS [11] 

51% 

MvDA [7] 

55.50% 

LRBP [13] 

91.12% 

LDoGBP [29] 

91.04% 

G-HFR 

96.04% 


TABLE 1 

Rank-1 recognition accuracies of the state-of-the-art 
approaches and our method on the CUFSF database. 


4.3 Experiments on the Viewed Sketch Database 

We compare the proposed G-HFR method with state-of- 
the-art approaches on the CUFSF database as shown in 
Table 1. For the transductive synthesis method (TFSPS) 
[4], query sketches are transformed into synthesized 
photos, and random sampling LDA (RS-LDA) [48] is 
used to match the synthesized photos to gallery photos. 
Because the photos and sketches in CUFSF involve light¬ 
ing variations and shape exaggerations, the synthesized 
photos have artifacts such as distortions. These artifacts 
degrade the performance of face recognition. For the 
common space projection based approaches PLS [11] 
and MvDA [7], a discriminant common space for two 
modalities is learnt. Although these two approaches 
have a strong generality and can be applied to vari¬ 
ous heterogeneous scenarios, they perform poorly on 
CUFSF as shown in Table 1. For feature descriptor based 
methods LRBP [13] and LDoGBP [29], feature descriptors 
which are invariant to different modalities are designed 
and used for recognition. These two approaches achieve 
good performance with accuracies of 91.12% and 91.04% 
respectively. However, these features ignore the spatial 
structure of faces. Our proposed method achieves a 
first match rate of (96.04±0.0076)% with 95% confidence 
interval and a tenth match rate of (99.86±0.0088)% 
with 95% confidence interval. Zhang et al. [17] achieved 
98.70% verification rates (VR) at 0.1% false acceptance 
rate (FAR) in comparison to 99.14% VR at 0.1% FAR of 
our proposed G-HFR method. 

4.4 Experiments on the Forensic Sketch Databases 

Matching forensic sketches to mug shots is much more 
difficult than matching aforementioned viewed sketches, 
because forensic sketches are drawn based on the eyewit¬ 
ness's descriptions. This can be easily affected by various 
eyewitnesses' face perceptions and sketch artists' per¬ 
ceptual experiences when drawing the forensic sketches. 
It is even harder when the eyewitness's description 
contains verbal overshadowing and memory distorting 
properties. The rank-50 accuracies of the state-of-the-art 
methods and the rank-50 accuracies with 95% confidence 
intervals of the proposed G-HFR method on the three 
types of forensic sketch databases are shown in Table 2. 

We first compare the recognition performance of the 
proposed G-HFR method with the method [36] on 
the IIIT-D Sketch Database. Considering the great dif¬ 
ferences between viewed sketch and forensic sketch. 



Fig. 6. Cumulative match score comparison of the base¬ 
line methods, the MCWLD method, and our method on 
the IIIT-D Sketch Database. 

Bhatt et al. [36] proposed to conduct training procedure 
on semi-forensic sketches and achieved better perfor¬ 
mance than the algorithm trained on viewed sketches. 
They encoded discriminating information from local re¬ 
gions using multiscale circular Weber's local descriptor 
(MCWLD) and optimized by an evolutionary memetic 
optimization algorithm. The MCWLD method utilizes 
140 semi-forensic sketches for training and 190 forensic 
sketches are taken as the probe images. 599 face photos 
plus 6,324 photos form the gallery. A rank-50 accuracy 
of 28.52% is achieved by this method. We follow the 
same partition protocol by randomly selecting 124 semi- 
forensic sketches for training and 168 forensic sketches 
are taken as the probe set. The gallery is composed of 168 
mug shot photos and 10,000 photos from the enlarged 
gallery set. Our method achieves a rank-50 accuracy of 
(30.36±0.07)% with 95% confidence interval. To better 
illustrate the performance of our method, we further in¬ 
troduce two baseline methods (PCA [32] and Fisherface 
[49]) in this paper, which achieve rank-50 accuracies of 
(10.71±0.07)% and (10.71 ±0.09)% respectively with 95% 
confidence interval on the IIIT-D Sketch Database. Figure 
6 presents a visual comparison of cumulative match 
scores and shows that our method achieves superior 
performance under different ranks on the IIIT-D Sketch 
Database. 

We next conduct experiment on the PRIP-VSGC 
database. The composite sketches are generated with 
each component approximated by the most similar com¬ 
ponent available in the composite software's database. 
Han et al. [31] proposed a component-based approach 
by using 123 composite sketches as the probe set and 
123 photos from the AR database [37] together with 
10,000 mug shots as the gallery. Klum et al. [19] re¬ 
cently proposed a FaceSketchID System to match fa¬ 
cial composites with mug shots. Both the holistic and 
component-based algorithms in the FaceSketchID Sys¬ 
tem were trained on viewed sketches and the match 
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Database 

Method 

Accuracy 

IIIT-D sketch database 

MCWLD [36] 

28.52% 


G-HFR 

(30.36±0.07)% 

PRIP-VSGC database 

Component-based [31] 
G-HFR 

<5% 

(51.22±0)% 

Forensic sketch database 

P-RS [15] 

20.80% 


G-HFR 

(31.96±0.41)% 


TABLE 2 

Rank-50 recognition accuracies of the state-of-the-art methods and rank-50 accuracies with 95% confidence 
intervals of the proposed G-HFR method on three types of forensic sketch databases. 



Fig. 7. Cumulative match score comparison of the 
baseline methods and our method on the PRIP-VSGC 
Database. 


scores were fused to improve the performance. Note 
that only the 123 composite sketches generated using 
Identi-Kit are available in the PRIP-VSGC database, our 
method is evaluated on these composite sketches follow¬ 
ing the same protocol with [31]. The component-based 
approach reported their results on matching different 
facial components of the composite sketches generated 
by Identi-Kit and all the rank-50 accuracies were lower 
than 5% in [31]. Our method achieves a rank-50 accuracy 
of (51.22±0)%. Because the training and test sets are fixed 
on the PRIP-VSCG database, the standard deviation and 
95% confidence interval are 0 on this composite sketch 
database. The comparison of cumulative match scores 
with baseline methods is shown in Figure 7. 

We finally conduct experiment on matching real world 
forensic sketches with mug shot photos. The proto¬ 
type random subspaces (P-RS) method [15] proposed 
by Klare et al. applied three different image filters and 
two different local feature descriptors to the probe and 
gallery images. A set of prototypes representing both 
the probe and gallery modalities are used for training 
and a random subspace framework is employed to boost 
the performance. They utilized 106 subjects for train¬ 
ing and 53 subjects plus 10,000 mug shots for testing 



Fig. 8. Cumulative match score comparison of the base¬ 
line methods, the P-RS method, and our method on the 
forensic sketch database. 

and achieved a rank-50 accuracy of 20.80%. We follow 
the same partition protocol as in [15] and 112 persons 
randomly selected from the forensic sketch database are 
taken as the training set. The remaining 56 persons 
are used for test. The gallery set is enlarged by 10,000 
photos from the enlarged gallery set. Our G-HFR method 
achieves a rank-50 accuracy of (31.96±0.41)% with 95% 
confidence interval, which outperforms the state-of-the- 
art method [15]. The cumulative match scores of the 
proposed method, the baseline methods, and the P-RS 
method [15] are shown in Figure 8. Due to the small 
scale of available forensic sketch database, there are 
not enough sketches for training a strong model. It is 
reasonable to believe that the recognition performance 
can be further improved with more forensic sketches 
available. 

4.5 Experiments on the Near Infrared Database 

We perform near infrared images to photos matching 
on the CASIA NIR-VIS 2.0 Face Database [20], which is 
a newly constructed challenging and practical database. 
There are 725 subjects with 17,850 NIR and VIS images in 
this database. Existing NIR-VIS matching methods were 
trained with multiple images per subject. Motivated by 
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Fig. 9. Cumulative match score comparison of the base¬ 
line methods and our method on the CASIA NIR-VIS 2.0 
Face Database. 

[15], the proposed method is trained with only one 
NIR-VIS pair per subject. Experiments using a smaller 
training set help demonstrate the value of our method. 
The single NIR and VIS image per subject are randomly 
selected. With 100 NIR-VIS pairs taken as the represen¬ 
tation dataset, 417 NIR-VIS pairs are randomly selected 
as the training set and the rest 208 pairs are used for 
test. The gallery is enlarged by 10,000 photos from the 
enlarged gallery set to mimic the real-world face retrieval 
scenario. The proposed method achieves a rank-1 and 
rank-50 accuracies of (54.90±0.30)% and (83.32±0.23)% 
respectively with 95% confidence intervals. Because this 
is a new database, we just compare our method with 
baselines. The cumulative match score comparison is 
shown in Figure 9. 

We further conduct experiments on CASIA NIR-VIS 
2.0 face database by following the standard evaluation 
protocols provided in [20]. We skipped tuning the pa¬ 
rameters on View 1 and the parameters were kept the 
same with the experimental settings section. We then 
randomly selected 150 persons from the training set on 
each sub-experiments of View 2 as the representation 
dataset. The rest NIR-VIS pairs in the training set are 
used for training. The testing images are still used for test 
following the standard evaluation protocols. The pro¬ 
posed method achieves a rank-1 accuracy (85.30±0.03)% 
with 95% confidence interval of. A dense SIFT with 
subspace LDA method proposed in [50] achieved a rank- 
1 accuracy of 73.28%. Yi et al. [51] utilized restricted 
Boltzmann machines (RBM) to learn a shared represen¬ 
tation for HFR and they reported an accuracy of 84.22% 
by introducing the RBM and an accuracy of 86.16% after 
removing the first 11 principle components of PC A. 

4.6 Experiments on the Thermal Infrared Database 

We perform thermal infrared images to photos matching 
on the USTC-NVIE database [21]. We randomly select 



Rank 

Fig. 10. Cumulative match score comparison of the 
baseline methods and our method on the USTC-NVIE 
database. 

one TIR image and one VIS image per subject and 
there are totally 129 TIR-VIS pairs. 60 TIR-VIS pairs 
are randomly selected as the representation dataset. We 
further randomly select 30 pairs to form the training set 
and the rest 39 pairs are used for test. The gallery is 
enlarged by 10,000 photos from the enlarged gallery set 
to make this scenario more realistic. The illumination 
and facial expression variations and glasses disguise 
effect make this database very challenging. The PCA 
method achieves rank-1 and rank-50 accuracies of both 
(0±0)% with 95% confidence intervals, and the Fisher- 
face method achieves (8.72±1.09)% and (36.15±2.50)% 
respectively. Our method achieves a rank-1 and rank-50 
accuracies of (77.44±2.17)% and (95.38±0.91)% respec¬ 
tively with 95% confidence intervals. The cumulative 
match score comparison is shown in Figure 10 and our 
method achieves excellent performance on this scenario. 
To our knowledge, there are two methods performing 
recognition between TIR and VIS images. The synthesis- 
based TIR-VIS matching method [6] was evaluated on 
only 47 subjects in the gallery, which achieved a rank- 
1 accuracy of 50.06%. The P-RS method [15] conducted 
TIR-VIS matching on a gallery of 10,333 subjects, with 
667 subjects for training and 333 subjects for testing. 
They achieved a rank-1 accuracy of 46.7%. 

5 Conclusions 

A graphical representation based heterogeneous face 
recognition method (G-HFR) is proposed in this paper. 
G-HFR employs Markov networks to represent heteroge¬ 
neous face images with the spatial information taken into 
consideration. Considering the coupled spatial property 
between heterogeneous face image patches, we propose 
a coupled representation similarity metric. Experiments 
are conducted to illustrate the effect of the proposed 
graphical representation and similarity metric in com¬ 
parison to common used representations and similar¬ 
ity metrics. Compared with state-of-the-art methods on 
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four heterogeneous face recognition scenarios (viewed 
sketch, forensic sketch, near infrared image, and thermal 
infrared image), G-HFR achieves superior performance 
in terms of face recognition accuracy. The key benefit 
of the proposed G-HFR method is that the spatial in¬ 
formation is crucial for face recognition by employing 
Markov networks to represent heterogeneous face im¬ 
ages separately. The proposed graphical representation 
can also be applied to other fields, such as standard 
face recognition, facial expression recognition, and so 
on. In the future, the effect of more types of features 
would be investigated to further improve the recognition 
performances on each of the HFR scenarios separately. 
Furthermore, we would evaluate the performance of the 
proposed G-HFR method on more heterogeneous face 
recognition scenarios. 
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